December 11, 2015
Raw data given as excel file with ~300 tabs, one for each patient, like this:
Along with timeseries measurements, the following static covariates were given for each patient:
As well as an outcome score called "GOS" measured at 3, 6, 12, and 24 months:
For the sake of modeling, the following assumptions were made:
There were 339 patients in raw data but only 268 in filtered dataset.
Note: Of 268 patients in filtered set, 18 had missing 3 month GOS
Frequency of non-time-dependent values:
Relationships between static variables and outcomes are pretty clear:
Timeseries measurements for 4 random patients:
Some observations on timeseries values:
Overall trends in measurements across all patients:
Things should level off by hour 48, validating the assumption that earlier measurements are more relevant.
A big challenge involves working around the fact that this model for the data below is not valid:
glm(outcome ~ age + pbto2, family='binomial')
| age | pbto2 | uid | tsi_min | outcome |
|---|---|---|---|---|
| 29 | 0 | 636 | 540 | 0 |
| 29 | 1 | 636 | 600 | 0 |
| 29 | 2 | 636 | 660 | 0 |
| 24 | 18 | 656 | 720 | 1 |
| 24 | 16 | 656 | 1080 | 1 |
Repeating the non-time-dependent variables like this leads to confidence intervals on coefficients that are unrealistically small (and not valid).
Possible Workarounds:
All of the models that follow take approach 3, though the way they deal with the thresholds will differ.
As a first attempt at modeling, all timeseries variables were separated into ranges based on known guidelines for safe values (Primarily from this Lab Value Guide).
For example, the PaCO2 variable was split into three new variables:
PaCO2 value in [0, 35)PaCO2 value in [35, 45)PaCO2 value in [45, +Inf)For all variables, the known "safe range" was then dropped from the model to avoid linear dependence (percentages across all variables would add up to 1). This leaves only variables indicating time spent at dangerous levels.
Modeling input dataset; includes static covariates and timeseries summaries:
## variable mean sd min max type ## 1 age 37.75 16.37 16 77.00 Static ## 2 gcs 5.90 1.46 3 8.00 Static ## 3 marshall 3.06 1.47 1 6.00 Static ## 4 sex 0.78 0.42 0 1.00 Static ## 5 icp1_20_inf 0.04 0.13 0 1.00 Timeseries ## 6 paco2_0_35 0.44 0.28 0 1.00 Timeseries ## 7 paco2_45_inf 0.05 0.12 0 0.80 Timeseries ## 8 pao2_0_30 0.04 0.09 0 0.43 Timeseries ## 9 pao2_100_inf 0.08 0.15 0 1.00 Timeseries ## 10 pbto2_0_20 0.31 0.31 0 1.00 Timeseries ## 11 pbto2_100_inf 0.00 0.02 0 0.19 Timeseries ## 12 pha_0_7.35 0.10 0.18 0 1.00 Timeseries ## 13 pha_7.45_inf 0.30 0.27 0 1.00 Timeseries
Best model selected through exhaustive AIC search:
glmulti('gos ~ .', family='binomial', level=1)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.484 | 0.198 | -7.506 | 0.000 |
| age | -0.639 | 0.195 | -3.275 | 0.001 |
| gcs | 0.542 | 0.184 | 2.939 | 0.003 |
| paco2_45_inf | 0.509 | 0.175 | 2.916 | 0.004 |
| pao2_0_30 | -0.430 | 0.202 | -2.124 | 0.034 |
| icp1_20_inf | -0.621 | 0.303 | -2.049 | 0.040 |
| pbto2_0_20 | -0.347 | 0.182 | -1.909 | 0.056 |
| pbto2_100_inf | -0.777 | 0.432 | -1.797 | 0.072 |
| marshall | -0.327 | 0.197 | -1.658 | 0.097 |
| Estimate | Nb models | Importance | +/- (alpha=0.05) | |
|---|---|---|---|---|
| paco2_0_35 | -0.045 | 37.000 | 0.294 | 0.190 |
| pao2_100_inf | -0.054 | 37.000 | 0.325 | 0.204 |
| sex | 0.067 | 37.000 | 0.342 | 0.238 |
| pha_0_7.35 | -0.115 | 44.000 | 0.427 | 0.353 |
| pha_7.45_inf | -0.115 | 47.000 | 0.454 | 0.336 |
| marshall | -0.241 | 65.000 | 0.701 | 0.450 |
| pbto2_0_20 | -0.269 | 69.000 | 0.780 | 0.420 |
| pao2_0_30 | -0.423 | 94.000 | 0.969 | 0.416 |
| icp1_20_inf | -0.620 | 95.000 | 0.977 | 0.622 |
| pbto2_100_inf | -0.748 | 99.000 | 0.996 | 0.844 |
| (Intercept) | -1.486 | 100.000 | 1.000 | 0.394 |
| age | -0.690 | 100.000 | 1.000 | 0.397 |
| gcs | 0.551 | 100.000 | 1.000 | 0.367 |
| paco2_45_inf | 0.521 | 100.000 | 1.000 | 0.394 |
Plot of 95% Confidence Intervals over all models, sized by importance:
A potential improvement on the linear models would involve not needing to set "threshold" ranges for blood gas measurements manually. The most basic version of this kind of model would involve a single threshold and a step function where having blood gas values on one side of that function has a different effect that the other side.
A generalization of this could allow more than one threshold as well as the possibility for gradual shifts between regions of values where effects differ.
A modified logistic model:
\[ logit(y_i) = \alpha + \beta \cdot X_i + f(G_{ij}) \]
where
\[ X_i = [Gender_i, Age_i, CommaScore_i, MarshallScore_i] \]
and
\[ f(G_i) = \frac{1}{n_i} \sum_j{ \frac{c_1}{1 + e^{-c_2(G_{ij} - c_3)}} + \frac{c_4}{1 + e^{-c_5(G_{ij} - c_6)}} } \] \[ n_i = \text{ length of timeseries for patient }i \]
The "double logistic" function definition for \(f\) allows for flexiblity in thresholds and graduated changes.
These are functions drawn from the priors in the model and show all possibilities:
By semi-simulated, I mean by taking the real data and hard coding coefficient / function values.